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Test statistics are often strongly dependent in large-scale multi- 
ple testing applications. Most corrections for multiplicity are unduly 
conservative for correlated test statistics, resulting in a loss of power 
to detect true positives. We show that the Westfall- Young permu- 
tation method has asymptotically optimal power for a broad class 
of testing problems with a block-dependence and sparsity structure 
among the tests, when the number of tests tends to infinity. 

1. Introduction. We consider multiple hypothesis testing where the un- 
derlying tests are dependent. Such testing problems arise in many applica- 
tions, in particular, in the fields of genomics and genome- wide association 
studies [9, 18, 24], but also in astronomy and other fields [21, 26]. Popular 
multiple-testing procedures include the Bonferroni-Holm method [19] which 
strongly controls the family- wise error rate (FWER), and the Benjamini- 
Yekutieli procedure [3] which controls the false discovery rate (FDR), both 
under arbitrary dependence structures between test statistics. If test statis- 
tics are strongly dependent, these procedures have low power to detect true 
positives. The reasons for this loss of power are well known: loosely speaking, 
many strongly dependent test-statistics carry only the information equiva- 
lent to fewer "effective" tests. Hence, instead of correcting among many 
multiple tests, one would in principle only need to correct for the smaller 
number of "effective" tests. Moreover, when controlling some error measure 
of false positives, an oracle would only need to adjust among the tests corre- 
sponding to true negatives. In large-scale sparse multiple testing situations, 



Received June 2011; revised November 2011. 

1 N. Meinshausen and M. H. Maathuis contributed equally to this work. 
AMS 2000 subject classifications. 62F03, 62J15. 

Key words and phrases. Multiple testing under dependence, West fall- Young procedure, 
permutations, familywise error rate, asymptotic optimality, high-dimensional inference, 
sparsity, rank-based nonparametric tests. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2011, Vol. 39, No. 6, 3369-3391. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



N. MEINSHAUSEN, M. H. MAATHUIS AND P. BUHLMANN 



this latter issue is usually less important since the number of true positives 
is typically small, and the number of true negatives is close to the overall 
number of tests. 

The dependence among tests can be taken into account by using the 
permutation-based Westfall- Young method [33] , already used widely in prac- 
tice (e.g., [6, 36]). Under the assumption of subset-pivotality (see Section 2.3 
for a definition), this method strongly controls the FWER under any kind 
of dependence structure [32]. 

In this paper we show that the Westfall- Young permutation method is 
an optimal procedure in the following sense. We introduce a single-step or- 
acle multiple testing procedure, by defining a single threshold such that all 
hypotheses with p- values below this threshold are rejected (see Section 2.2). 
The oracle threshold is the largest threshold that still guarantees the desired 
level of the testing procedure. The oracle threshold is unknown in practice 
if the dependence among test statistics and the set of true null hypotheses 
are unknown. We show that the single-step Westfall- Young threshold ap- 
proximates the oracle threshold for a broad class of testing problems with 
a block-dependence and sparsity structure among the tests, when the num- 
ber of tests tends to infinity. Our notion of asymptotic optimality relative to 
an oracle threshold is on a general level and for any specified test statistic. 
The power of a multiple testing procedure depends also on the data gener- 
ating distribution and the chosen individual test(s): we do not discuss this 
aspect here. Instead, our goal is to analyze optimality once the individual 
tests have been specified. 

Our optimality result has an immediate consequence for large-scale multi- 
ple testing: it is not possible to improve on the power of the Westfall- Young 
permutation method while still controlling the FWER when considering 
single-step multiple testing procedures for a large number of tests and as- 
suming only a block-dependence and sparsity structure among the tests 
(and no additional modeling assumptions about the dependence or clus- 
tering/grouping). Hence, in such situations, there is no need to consider 
ad-hoc proposals that are sometimes used in practice, at least when taking 
the viewpoint that multiple testing adjusted p- values should be as model 
free as possible. 

1.1. Related work. There is a small but growing literature on optimal- 
ity in multiple testing under dependence. Sun and Cai [30] studied and 
proposed optimal decision procedures in a two-state hidden Markov model, 
while Genovese et al. [12] and Roeder and Wasserman [28] looked at the 
intriguing possibility of incorporating prior information by p-value weight- 
ing. The effect of correlation between test statistics on the level of FDR 
control was studied in Benjamini and Yekutieli [3] and Benjamini et al. [2]; 



OPTIMALITY OF THE WESTFALL-YOUNG PERMUTATION PROCEDURE 3 



see also Blanchard and Roquain [4] for FDR control under dependence. Fur- 
thermore, Clarke and Hall [7] discuss the effect of dependence and clustering 
when using "wrong" methods based on independence assumptions for con- 
trolling the (generalized) FWER and FDR. The effect of dependence on the 
power of Higher Criticism was examined in Hall and Jin [16, 17]. Another 
viewpoint is given by Efron [10], who proposed a novel empirical choice of an 
appropriate null distribution for large-scale significance testing. We do not 
propose new methodology in this manuscript but study instead the asymp- 
totic optimality of the widely used Westfall- Young permutation method [33] 
for dependent test statistics. 

2. Single-step oracle procedure and the Westfall Young method. Af- 
ter introducing some notation, we define our notion of a single-step oracle 
threshold and describe the Westfall- Young permutation method. 

2.1. Preliminaries and notation. Let W be a data matrix containing n 
independent realizations of an m-dimensional random variable X = (X\ , . . . , 
X m ) with distribution P m and possibly some additional deterministic re- 
sponse variables y. 

Prototype of data matrix W . To make this more concrete, consider the 
following setting that fits the examples described in Section 3.2. Let y be 
a deterministic variable, and allow the distribution of X = X y to depend 
on y. For each value y^\ i = 1, . . . ,n, we observe an independent sample 
x (i) = j _ _ _ ^ j of x = x ^ _ We then define to be an (m + 1) x n- 

dimensional matrix by setting Wi t i = for i = 1, . . . , n and Wj+i,j = Xy 
for j = 1,.. .,m and i = 1, . . . ,n. Thus, the first row of W contains the 
y-variables, and the ith column of W corresponds to the ith data sam- 
ple 

Based on W, we want to test m null hypotheses Hj, j = 1, . . . , m, con- 
cerning the m components X\, . . . ,X m of X. For concrete examples, see 
Section 3.2. Let I(P m ) Q {1, . . . , m} be the indices of the true null hypothe- 
ses, and let I'(P m ) be the indices of the true alternative hypotheses, that 
is, I'(P m ) = {1, . . . ,m} \ I(P m ). Let Po be a distribution under the com- 
plete null hypothesis, that is, I(Po) = {1, • • • , m}. We denote the class of all 
distributions under the complete null hypothesis by TV 

Suppose that the same test is applied for all hypotheses, and let S n C [0, 1] 
be the set of possible p- values this test can take. Thus, S n = [0, 1] for i-tests 
and related approaches, while S n is discrete for permutation tests and rank- 
based tests. Let Pj(W), j = 1, . . . ,m, be the p- values for the m hypotheses, 
based on the chosen test and the data W. 
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2.2. Single-step oracle multiple testing procedure. Suppose that we knew 
the true set of null hypotheses I(P m ) and the distribution of min Jg/ (p m ) Pj(W) 
under P m (which is of course not true in practice). Then we could de- 
fine the following single-step oracle multiple testing procedure: reject Hj 
if Pj(W) < c m:n (a), where c m ,, n (a) is the a-quantile of mhij g /(p m ) PjiW) 
under P m . 

(1) c m ^ n {a) = max|s G Sri ■ Pm 

^ min. Pj(W) < s^j < a j. 

Throughout, we define the maximum of the empty set to be zero, corre- 
sponding to a threshold c m ^ n {a) that leads to zero rejections. 

This oracle procedure controls the FWER at level a, since, by definition, 

P m (Hj is rejected for at least one j G I{P m )) 
= P m ( min Pj{W) < Cm,n{ot) ) < a, 

and it is optimal in the sense that values c G S n with c > c mj „(a) no longer 
control the FWER at level a. 



2.3. Single-step Westf all- Young multiple testing procedure. TheWestfall- 
Young permutation method is based on the idea that under the complete 
null hypothesis, the distribution of W is invariant under a certain group of 
transformations G, that is, for every g €Q, gW and W have the same distri- 
bution under Pq Romano and Wolf [29] refer to this as the "random- 
ization hypothesis." In the sequel, Q is the collection of all permutations g 
of {1, . . . ,n}, so that the number of elements \G\ equals n\. 

Prototype permutation group Q acting on the prototype data matrix W . In 
the examples in Section 3.2, W is a prototype data matrix as described in 
Section 2.1. The prototype permutation g £ Q leads to a matrix gW obtained 
by permuting the first row of W (i.e., permuting the y-variables). For all 
examples in Section 3.2, under the complete null hypothesis Pq G Vq, the 
distribution of gW is then identical to the distribution of W for all g G Q, so 
that the randomization hypothesis is satisfied. We suppress the dependence 
of \G\ on the sample size n for notational simplicity. 



The single-step Westfalf- Young critical value is a random variable, defined 
as follows: 

Cm,n(a) = max! s G S n : _™ m Pj^W) < s\ < a\ 

1 11 9 e6 3 ~ ""' m J 

:{s G S n : P* ( min Pj (W) < s) < a } , 

I \j=l,...,m ) ) 



max^ 
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where 1{-} denotes the indicator function, and P* represents the permuta- 
tion distribution 

(2) P*(f(w) < x) = ±- Hf(gW) < x} 

for any function /(•) mapping W into M. In other words, c m ^ n {a) is the 
a-quantile of the permutation distribution of W3n.j=\ m pj(W) . Our main 
result (Theorem 1) shows that under some conditions, the Westfall- Young 
threshold c m ^ n {a) approaches the oracle threshold c mtn (a). 

It is easy to see that the Westfall- Young permutation method provides 
weak control of the FWER, that is, control of the FWER under the com- 
plete null hypothesis. Under the assumption of subset-pivotality, it also pro- 
vides strong control of the FWER [33], that is, control of the FWER under 
any set I{P m ) of true null hypotheses. Subset-pivotality means that the 
distribution of {pj{W) :j G K} is identical under the restrictions flje-ftr^i 
and C\jei(p ) Hj f° r an possible subsets K C I(P m ) of true null hypotheses. 
Subset-pivotality is not a necessary condition for strong control; see, for ex- 
ample, Romano and Wolf [29], Westfall and Troendle [31] and Goeman and 
Solari [13]. 

3. Asymptotic optimality of Westfall Young. We consider the frame- 
work where the number of hypotheses m tends to infinity. This framework 
is suitable for high-dimensional settings arising, for example, in microarray 
experiments or genome-wide association studies. 

3.1. Assumptions. (Al) Block-independence: the p- values of all true null 
hypotheses adhere to a block-independence structure that is preserved un- 
der permutations in Q. Specifically, there exists a partition A±, . . . , As m of 
{1, . . . , m} such that for any pair of permutations g, g' £ Q, 

min min pAgW), b=l,...,B m , 

g£{g,9'}j£A b nl(P m ) 

are mutually independent under P m . Here, the number of blocks is denoted 
by B = B m . [We assume without loss of generality that A^ n I{P m ) / for 
all b = 1, . . . , B, meaning that there is at least one true null hypothesis in 
each block; otherwise, the condition would be required only for blocks with 

A b ni(P m )^0.\ 

(A2) Sparsity: the number of alternative hypotheses that are true un- 
der P m is small compared to the number of blocks, that is, \I'(P m )\/B m — > 
asm-) oo. 

(A3) Block-size: the maximum size of a block, ms m := m.ax.b=\,...,B m \Ab\, 
is of smaller order than the square root of the number of blocks, that is, 
m B m = o(^B m ) as m — > oo. 
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(Bl) Let G be a random permutation taken uniformly from Q. Under P m , 
the joint distribution of {pj(W) :j G I(P m )} is identical to the joint distri- 
bution of {pj(GW) :j G I(P m )}. ' 

(B2) Let P* be the permutation distribution in (2). There exists a con- 
stant r < oo such that for s = c mj „(a) G S n and all W, 

(3) r- 1 s<P*{p j (W)<s)<rs for all j = 1, . . . ,m. 

(B3) The p-values corresponding to true null hypotheses are uniformly 
distributed; that is, for all j G I(P m ) and s G S n , we have P m (pj(W) < s) = s. 

A sufficient condition for the block-independence assumption (Al) is that 
for every fixed pair of permutations g,g' GQ the blocks of random vari- 
ables {pj(gW),pj(g'W) : j G Af, n I(P m )} are mutually independent for b = 
l,...,B m . This condition is implied by block-independence of the m last 
rows of the prototype W for the examples discussed in Section 3.2 and 
for the prototype Q as in Section 2.3. The block-independence assumption 
captures an essential characteristic of large-scale testing problems: a test 
statistic is often strongly correlated with a number of other test statistics 
but not at all with the remaining tests. 

The sparsity assumption (A2) is appropriate in many contexts. Most 
genome-wide association studies, for example, aim to discover just a few 
locations on the genome that are associated with prevalence of a certain 
disease [20, 23]. Furthermore, assumption (A3) requiring that the range of 
(block-) dependence is not too large, which seems reasonable in genomic 
applications: for example, when having many different groups of genes (e.g., 
pathways), each of them not too large in cardinality, a block-dependence 
structure seems appropriate. 

We now consider assumptions (B1)-(B3), supposing that we work with 
a prototype data matrix W and a prototype permutation group Q as de- 
scribed in Sections 2.1 and 2.3. Assumption (Bl) is satisfied if each p-va- 
lue Pj(W) only depends on the 1st and (j + l)th rows of W. Moreover, 
subset-pivotality is satisfied in this setting. Assumption (B3) is satisfied for 
any test with valid type I error control. Assumption (B2) is fulfilled with 
r = 1 if for all W 

(4) P G ( Pj (GW)<s\W) = s, j = l,...,m,seS n , 

where Pq is the probability with respect to a random permutation G taken 
uniformly from Q, so that the left-hand side of (4) equals P*(pj(W) < s) 
in (3). Note that assumptions (Bl) and (B3) together imply that 

(5) P m!G ( Pj (GW)<s) = s, jeI{P m ),seS n , 

where the probability P m G is with respect to a random draw of the data W, 
and a random permutation G taken uniformly from Q. Thus, assumption (B2) 
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holds if (5) is true for all j = 1, . .. , m when conditioned on the observed 
data. Section 3.2 discusses three concrete examples that satisfy assump- 
tions (B1)-(B3) and subset-pivotality. 

Remark. For our theorems in Section 3.3, it would be sufficient if (3) 
were holding only with probability converging to 1 when sampling a ran- 
dom W, but we leave a deterministic bound since it is easier notationally, 
the extension is direct and we are mostly interested in rank-based and con- 
ditional tests for which the deterministic bound holds. 

3.2. Examples. We now give three examples that satisfy assump- 
tions (B1)-(B3), as well as subset-pivotality. As in Section 2.1, let y be 
a deterministic scalar class variable and X = (Xi, . . . , X m ) an m-dimensional 
vector of random variables, where the distribution of X = X y can depend 
on y. Let the prototype data matrix W and the prototype group of permuta- 
tions Q be defined as in Sections 2.1 and 2.3, respectively. In all examples, we 
work with tests with valid type I error control, and each p- value pj(W) only 
depends on the 1st and (J + l)th rows of W. Hence, assumptions (Bl), (B3) 
and subset-pivotality are satisfied, and we focus on assumption (B2) in the 
remainder. 

For the examples in Sections 3.2.1 and 3.2.2, we assume that there exists 
a n(y) S M m and an m-dimensional random variable Z = (Z±, . . . , Z m ) such 
that 



We omit the dependence of X = X y on y in the following for notational 
simplicity. 

3.2.1. Location-shift models. We consider two-sample testing problems 
for location shifts, similar to Example 5 of Romano and Wolf [29]. Using 
the notation in (6), y G {1,2} is a binary class variable, and the marginal 
distributions of Z are assumed to have a median of zero. 

We are interested in testing the null hypotheses 



We now discuss location-shift tests that satisfy assumption (B2). First, 
note that all permutation tests satisfy (B2) with r = 1, since the p-values 
in a permutation test are defined to fulfill P*(pj(W) < s) = s for all s 6 S n . 
Permutation tests are often recommended in biomedical research [22] and 



(6) 



X = X y = fi(y) + Z. 




m 



m. 
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other large scale location-shift testing applications due to their robustness 
with respect to the underlying distributions. For example, one can use the 
Wilcoxon test. Another example is a "permutation t-test": choose the p- 
value Pj(W) as the proportion of permutations for which the absolute value 
of the t-test statistic is larger than or equal to the observed absolute value 
of the t-test statistic for Hj. Then condition (B2) is fulfilled with r = 1 
with the added advantage that inference is exact, and the type I error is 
guaranteed even if the distributional Gaussian assumption for the t-test 
is not fulfilled [14]. Computationally, such a "permutation t-test" procedure 
seems to involve two rounds of permutations: one for the computation of the 
marginal p- value and one for the Westfall- Young method; see (2). However, 
the computation of the marginal permutation p-value can be inferred from 
the permutations in the Westfall- Young method, as in Meinshausen [25], 
and just a single round of permutations is thus necessary. 

3.2.2. Marginal association. Suppose that we have a continuous vari- 
able y in formula (6). Based on the observed data, we want to test the null 
hypotheses of no association between variable Xj and y, that is, 

Hj : fJ,j(y) is constant in y, j = 1, . . . , m, 

versus the corresponding two-sided alternatives. A special case is the test 
for linear marginal association, where the functions Pj(y) for j = 1, . .. ,m 
are assumed to be of the form Pj(y) = jj + fyy, and the test of no linear 
marginal association is based on the null hypotheses 

H j-Pj = Q, j = !,■■■, m. 

Rank-based correlation test like Spearman's or Kendall's correlation co- 
efficient are examples of tests that fulfill assumption (B2). Alternatively, 
a "permutation correlation-test" could be used, analogous to the "permuta- 
tion t-test" described in Section 3.2.1. 

3.2.3. Contingency tables. Contingency tables are our final example. Let 
y G {1, 2, . . . , Ky} be a class variable with K y distinct values. Likewise, as- 
sume that the random variable X is discrete and that each component of X 
can take K x distinct values, X = (X±, . . . ,X m ) G {1,2,... ,K x } m . 

As an example, in many genome-wide association studies, the variables of 
interest are single nucleotide polymorphisms (SNPs). Each SNP j (denoted 
by Xj) can take three distinct values, in general, and it is of interest to see 
whether there is a relation between the occurrence rate of these categories 
and a category of a person's health status y [5, 15, 20]. 

Based on the observed data, we want to test the null hypothesis for j = 
1, . . . , m that the distribution of Xj does not depend on y, 

Hj : P(Xj = k\y) = P(Xj = k) for all k G {1, ... , K x ] and y G {1, . . . , K y }. 
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The available data for hypothesis Hj is contained in the 1st and (j + l)th 
rows of W. These data can be summarized in a contingency table and 
Fisher's exact test can be used. Since the test is conditional on the marginal 
distributions, we have that P{pj{GW) < s\W) = s for a random permutation 
G € Q and (B2) is fulfilled with r = 1. 

3.3. Main result. We now look at the properties of the Westfall- Young 
permutation method and show asymptotic optimality in the sense that, with 
probability converging to 1 as the number of tests increases, the estimated 
Westfall- Young threshold c mj „(a) is at least as large as the optimal oracle 
threshold c m>n (a — 5), where 5 > can be arbitrarily small. This implies 
that the power of the Westfall- Young permutation method approaches the 
power of the oracle test, while providing strong control of the FWER under 
subset-pivotality [33]. All proofs are given in Section 6. 

Theorem 1. Assume (Al)-(A3) and (B1)-(B3). Then for any a <E 
(0, 1) and any 5 £ (0, a) 

(7) P m {cm, n {ct) > c m<n (a - 5)} ->■ 1 asm^-oo. 

We note that the sample size n can be fixed and does not need to tend 
to infinity. However, if the range of p- values S n is discrete, the sample size 
must increase with m to avoid a trivial result where the oracle threshold 
Cm,n((x ~ 5) vanishes; see also Theorem 2 where this is made explicit for the 
Wilcoxon test in the location-shift model of Section 3.2.1. 

Theorem 1 implies that the actual level of the Westfall- Young procedure 
converges to the desired level (up to possible discretization effects; see Sec- 
tion 3.4). To appreciate the statement in Theorem 1 in terms of power gain, 
consider a simple example. Assume that the m hypotheses form B blocks. 
In the most extreme scenario, test statistics are perfectly dependent within 
each block. In such a scenario, the oracle threshold (1) for each individual 
p-value is then 

1 — y/l — a, 

which is larger than, but very closely approximated by a/B for large val- 
ues of B. Thus, when controlling the FWER at level a, hypotheses can be 
rejected when their p-values are less than 1 — yl — a and certainly when 
their p- values are less than a/B. However, the value of B and the block- 
dependence structure between hypotheses are unknown in practice. With 
a Bonferroni correction for the FWER at level a, hypotheses can be rejected 
when their p- values are less than a/m. If m S> B, the power loss compared to 
the procedure with the oracle threshold is substantial, since the Bonferroni 
method is really controlling at an effective level of size aB/m instead of a. 
Theorem 1, in contrast, implies that the effective level under the Westf all- 
Young procedure converges to the desired level (up to possible discretization 
effects). 
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3.4. Discretization effects with Wilcoxon test. We showed in the last sec- 
tion that the Westfall- Young permutation method is asymptotically equiva- 
lent to the oracle threshold under the made assumptions. In this section we 
look in more detail at the difference between the nominal and effective levels 
of the oracle multiple testing procedure. Controlling at nominal level a, the 
effective oracle level is defined as 

(8) a_ = P m { min Pj(W) < c m , n (a)\ . 

^jei(P m ) > 

By definition, a_ is less than or equal to a. We now examine under which 
assumptions the effective level a_ can be replaced by the nominal level a. 
As a concrete example, we work with the following assumptions: 

(W) The test is a two-sample Wilcoxon test with equal sample sizes 
n\ = H2 = n/2, applied to a location-shift model as defined in Section 3.2.1. 

(A3') Block-size: the maximum size of a block satisfies mg = O(l) as 
m — > oo. 

The restriction to equal sample sizes in (W) is only for technical simplicity. 
We then obtain the following result about the discretization error. 

Theorem 2. Assume (W). Then the oracle critical value c mj „(a) is 
strictly positive when 

n > 21og 2 (fra/a) + 2. 

When assuming in addition (Al), (A2) and (A3') ; then the results of The- 
orem 1 hold, and, for any a € (0, 1), we have 

a_ — > a 

as m, n — > oo such that nj log(m) — > oo. 

The first result in Theorem 2 says that the oracle critical value for the 
test defined in (W) is nontrivial, even when the number of tests grows al- 
most exponentially with sample size. Hence, in this setting the result from 
Theorem 1 still applies in a nontrivial way. 

The second result in Theorem 2 gives sufficient criteria for the effective 
oracle level a_ to converge to a. It is conceivable that this result can also be 
obtained under a milder assumption than (A3'), but this requires a detailed 
study of the Wilcoxon p- values, and we leave this for future work. The main 
takeaway message is that discreteness of the p-values does not change the 
optimality result fundamentally. 

4. Empirical results. The power of the Westfall- Young procedure has 
already been examined empirically in several studies. Westfall et al. [34] 
includes a comparison with the Bonferroni-Holm method, reporting a gain 
in power when using the Westfall- Young procedure. Its focus is on "genetic 
effects in association studies," including genotype (SNP-type) analysis and 
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also gene expression microarray analysis. Becker and Knapp [1] apply the 
Westfall- Young permutation procedure and report substantial gain in power 
over Bonferroni correction in the context of haplotype analysis. Yekutieli and 
Benjamini [37] and Reiner et al. [27] discuss the gain of resampling in terms 
of power, although their focus is mainly on FDR controlling methods. Also 
Dudoit et al. [8] and Ge et al. [11] report that resampling-based methods 
such as the Westfall- Young permutation procedure have clear advantages in 
terms of power. 

Here, we look at a few simulated examples to study the finite-sample 
properties and compare with the asymptotic results of Theorem 1. Data 
for a two-sample location-shift model as in Section 3.2.1 with m hypothe- 
ses and equal sample sizes of 50 are generated from a multivariate Gaus- 
sian distribution with unit variances and (i) a Toeplitz correlation matrix 
with correlations pij = pi' - -? for some p € {0.95, 0.975, 0.99} and 1 < i,j < m 
and (ii) a block covariance model, where correlations within all blocks (of 
size 50 each) are set to the same p G {0.6,0.75,0.9} and to outside of 
each block. Ten alternative hypotheses are picked at random from the first 
100 components by applying a shift of 0.75 whereas all remaining m — 10 
null- hypotheses correspond to no shift. 

We use a two-sided Wilcoxon test. The power over 250 simulations of 
the single-step and step-down methods of the Bonferroni-correction, the or- 
acle procedure and the Westfall- Young permutation procedure are shown 
in Figure 1 for level a = 0.05. The step-down version of the Bonferroni cor- 
rection is the Bonferroni-Holm procedure [19] and the step-down version 
of the Westfall- Young procedure is given in Westfall and Young [33]. The 
oracle threshold (both single-step and step-down) is approximated on a sep- 
arate set of 1000 simulations, and the Westfall- Young method is using 1000 
permutations for each simulation. 

The following main results emerge: the Westfall- Young method is very 
close in power to the oracle procedure for all values of m in the block 
model, giving support to the asymptotic results of Theorem 1. Moreover, 
the Westfall- Young and the oracle procedures are also very similar in the 
Toeplitz model, indicating that Theorem 1 may be generalized beyond block- 
independence models. The power gains of the Westfall- Young procedure, 
compared to Bonferroni-Holm, are substantial; between 20% and 250% for 
the considered scenarios, where the largest gains are achieved in settings 
with a large number of hypotheses and high correlations. Finally, the dif- 
ference between step-down and single-step methods is very small in these 
sparse high-dimensional settings for all three multiple testing methods. 

It might be unexpected that the power of the Westfall- Young is slightly 
larger than the oracle procedure for two settings with very high correlations. 
This is due to the finite number of simulations when approximating the 
oracle threshold, a finite number of permutations in the Westfall- Young 
procedure and a finite number of simulation runs. The family-wise error rate 
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Fig. 1. Average number of true positives for the Toeplitz model (top row) and the block 
model (bottom row). The number of hypotheses is increasing from m = 100 (left panel) to 
rn = 10,000 (right panel) and the correlation parameter is varied as p = 0.95, 0.975, 0.99 
in the Toeplitz model and p = 0.6, 0.75, 0.9 in the block model. The results are shown for 
the Bonferroni correction (single-step and corresponding step-down Bonferroni-Holm; light 
color); the oracle procedure (single-step and step-down; gray color) and the West) all-Young 
permutation procedure (single-step and step-down; dark color). 



is between 0.03 and 0.04 for both oracle and Westfall— Young procedures and 
below 0.02 for the Bonferroni correction in the Toeplitz model. The nominal 
level of a = 0.05 is exceeded sometimes in the block model (again for the 
reason that we use only a finite number of simulations) , where both the oracle 
and Westfall- Young procedures attain a family-wise error rate between 0.04 
and 0.07 in all settings. 
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The computational cost of the Westfall- Young procedures scales approxi- 
mately linearly with the number m of hypotheses. When using 1000 permu- 
tations, computing the Westfall- Young threshold takes about 1.4 seconds 
per hypotheses on a 3 GHz CPU. For the largest setting of m = 10,000 hy- 
potheses, the threshold could thus be computed in just over 2 minutes. It 
seems as if this computational cost is acceptable, even for very large-scale 
testing problems. 

5. Discussion. We considered asymptotic optimality of large-scale multi- 
ple testing under dependence within a nonparametric framework. We showed 
that, under certain assumptions, the Westfall- Young permutation method is 
asymptotically optimal in the following sense: with probability converging to 
1 as the number of tests increases, the Westfall- Young critical value for mul- 
tiple testing at nominal level a is greater than or equal to the unknown oracle 
threshold at level a — 5 for any 5 > 0. This implies that the actual level of 
the Westfall- Young procedure converges to the effective oracle level a_. To 
investigate the possible impact of discrete p- values, we studied a specific ex- 
ample and provided sufficient conditions that ensure that a_ converges to a. 

We gave several examples that satisfy subset-pivotality and our assump- 
tions (B1)-(B3) [while assumptions (Al)-(A3) are about the unknown data- 
generating distribution] . Most of these examples involve rank-based or per- 
mutation tests. These tests are appropriate for very high-dimensional testing 
problems. If the number of tests is in the thousands or even millions, extreme 
tail probabilities are required to claim significance, and these tail probabili- 
ties are more trustworthy under a nonparametric than a parametric test. 

If the hypotheses are strongly dependent, the gain in power of the Westfall- 
Young method compared to a simple Bonferroni correction can be very sub- 
stantial. This is a well known, empirical fact, and we have established here 
that this improvement is also optimal in the asymptotic framework we con- 
sidered. 

Our theoretical results could be expanded to include step-down proce- 
dures like Bonferroni-Holm [19] and the step-down Westfall- Young method 
[11, 33]. The distinction between single-step and step-down procedures is 
very marginal though in our sparse high-dimensional framework, as reported 
in Section 4, since the number of rejected hypotheses is orders of magnitudes 
smaller than the total number of hypotheses. 

6. Proofs. After introducing some additional notation in Section 6.1, 
the proof of Theorem 1 is in Section 6.2 and the proof of Theorem 2 is in 
Section 6.3. 

6.1. Additional notation. Let p^{W) be the minimum p- value over all 
true null hypotheses in the 6th block: 

p^(W)= min pj(W), b = l,...,B, 
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and let 717, (c) denote the probability under P m that p( b \W) is less than or 
equal to a constant c G [0, 1], that is, 

Mc)=Pm(p (b) (W)<c), b=l,...,B. 

Throughout, we denote the expected value, the variance and the covariance 
under P m by E m , Var m and Cov m , respectively. 

6.2. Proof of Theorem 1. Let a' G (0, 1) and 5' G (0, a'). Let <5 = S'/2 and 
a = a' — 5' . Then writing expression (7) in terms of a' and 6' is equivalent 
to 

Pm{Cm,n{a + 25) > C rn , n (a)} ->■ 1 as 77J ^ OO. 

By definition, 

Cm,n(ct + 25) = max js £ S n :P*( min pj (W ) < s ) < a + 2<j) . 
We thus have to show that 

(9) P m <P*l min pj (W)<c m , n {a)) <a + 25 ^1 
as m — > oo. 

First, we show in Lemma 1 that there exists an M < oo such that 
P* ( min p 3 {W) < c m , n (a)) < P* ( min Pj (W) < c m , n (a)) + 6 

\je{l,...,m} / \?'6l(Pm) ' 

for all m > M and for all W. This result is mainly due to the sparsity 
assumption (A2). Second, we show in Lemma 2 that 

(10) P m {p*( mm pj(W) <Cm >n (a)j <a + 5^ 1 form^oo. 

Theorem 1 follows by combining these two results. 

Lemma 1. Let a G (0,1), 5g (0, a), and assume (Al), (A2), (B2) 
and (B3) . T/ien there exists an M < oo sitc/i £/ia£ 

P* ( min Pj (W) < c m , n (a)) < P* ( min Pj (W) < c m , n (a)) + 5 
\je{i,...,m} / Ve/(Pm) ' 

/or aZZ m> M and for all W . 
Proof. Note that 

c m,n(oO £ fin by definition. Using the union bound, 
we have, for all s G S n and all W, 



P* min p,-(W) < s < P* min p 7 -(W) < s 
Ve{i,...,m} / Ve/(Pm) 

+ £ P*(Pj(W)<a) 



OPTIMALITY OF THE WESTFALL-YOUNG PERMUTATION PROCEDURE 15 



Hence, we only need to show that there exists an M < oo such that 

(12) ]T P*( Pj (W)<c m , n ( a ))<5 

jei'(P m ) 

for all m > M and all W. By assumption (B2) with constant r, 
P*(Pj(W) < c m , n (a)) <\l'(P m )\ 

jei'(P m ) 

(13) 

= r^Bc m , n (a). 

Since \I'(P m )\/B — > as m — > oo by assumption (A2), and Sc m>n (a) is 
bounded above by — log(l — a) under assumptions (Al) and (B3) (see Lem- 
ma 3), we can choose a M < oo such that the right-hand side of (13) is 
bounded above by 5 for all m > M. This proves the claim in (12) and com- 
pletes the proof. □ 

Lemma 2. Let a > and 5 > and assume (Al), (A3) and (B1)-(B3). 
Then 

Pm\P*\ min Pi(W) < Cm n (a) ) < a + S > — )■ 1 as m->oo. 

^ W(Pm) / J 

Proof. Let e > 0. The statement in the lemma is equivalent to showing 
that there exists an M < oo such that 

(14) Pm{P*( min Pj (W) > c m>n (a)) < 1 - a - j} < e 

k VjG-f(Pm) / J 

for all m > M. By definition, 

P*( min >c m n (a)) = -!- V]l{ min p j (gW)>c mn {a)\ 

\jei(p m ) > L J' e/ ( p ™) J 

(15) 
where 

il(^,W):=l{ min Pj(jgW) > c m>n (a)\. 

(We suppress the dependence on m,n,P m and a for notational simplicity.) 

Let G be a random permutation, chosen uniformly in Q, and let 1 denote 
the identity permutation. Then, by assumption (Bl), it follows that 

Em (tg\ ^ r( - 9, w) ) = Em > cR ( G > w ^ = E mR(hW). 

Vl 'see ' 
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By definition of c min (a) [see (1)], 



E m R(l,W)=P m ( min Pj {W) > c m , n {a)) > 1 - a. 

Hence, the desired result (14) follows from a Markov inequality as soon as 
one can show that the variance of (15) vanishes as m — > oo, that is, if 

(16) Var m ( |1 R(9, W)j = j-^ ]T Cov m (R(g, W),R(g', W)) = o(l) 

Vl 1 g&G J 11 lj g,g'£G 

as m — > oo. 

Let G, G' be two random permutations, drawn independently and uni- 
formly from Q. Then 

Cov m>G , GI (R(G,W),R{G',W)) = j^ Yl Cov m (R(g,W),R{g',W)). 

Hence, in order to show (16), we only need to show that 

Cov m , G , G ,{R(G,W),R{G',W))=o(l) for m -> oo. 

Define 

(17) R b (g, W) := l{p {b) (gW) > c m , n (a)}, 

so that R(g, W) := n^=i W). We then need to prove that, asm-) oo, 

(18) E m>G>G r ^JJ i?&(G, W)R b (G', W) \ — ^E m ,G (^Y\^^-b(G, W)^j \ =o(l). 
Using assumption (Al), the left-hand side in (18) can be written as 

B B 

H E m ^ GI {R b {G, W)R b (G', W)} - H[E m , G {R b (G, W)}} 2 . 

6=1 6=1 

Note that E m> Q t Q'{R b (G, W)R b (G', W)} and [E m>G {R b (G, W)}] 2 are boun- 
ded between and 1. For sequences of numbers ai,...,ae and &i,...,&b 
that are bounded between and 1, the following inequality holds: 

B 



B B 



E{(^-6i)(n 6 *) (n 

.7=1 ^ V fc<7 7 V fc>7 



<J^|aj-6j| 



n%-n & 

j=i i=i 

Hence, in order to show (18) it is sufficient to show that 

max \E m)GjG ,{R b (G,W)R b (G', W)} - [E m , G {R b (G,W)}} 2 \ 

6=1,. ..,B 

(19) 

asm-) oo. 

Conditional on W, 

Rb(G, W),Rb(G', W) i ~ d - Bernoulli (/i 6 (W)), 
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where fJ- b (W) is the proportion of all permutations g £ Q for which R b (g, W) = 
1 or, equivalently, 

(20) fi b (W) = P m>G {p^(GW) > c m , n (a)\W} = P*{p^ b \W) > c m , n (a)}. 

Thus, the random proportion ^ b {W) is a function of W. Denote its distri- 
bution by F b . Using Lemma 4, the support of F b is contained in the interval 
[1 — log{l/(l — «)}ar 2 mBB _1 ,l] under assumptions (Al), (Bl) and (B2). 
Hence, using Lemma 5, it follows that 

< E m , GyG ,{R b (G,W)R b (G',W)} - [E m , G {R b (G, W)}] 2 
< (log{l/(l - a)}ar 2 m B B- l f. 
Since mg = o(y/B) under assumption (A3), claim (19) follows. □ 

Lemma 3. Under assumptions (Al) and (B3), we have 

B 

(21) Bc m>n {a) <J2*b(cmA a )) <MV(l-4 

6=1 

Proof. Let b € {1, ... , B} and j b e I(P m ) n A b . Then 

(22) vr fe {c min (a)} > P m (p jb (W) < c m<n (a)) = c mj „(a), 

where the inequality follows from the definition of 7r&(-), and the equality 
follows from assumption (B3) and the fact that c min (a) € S* n . Summing (22) 
over b = 1, . . . , B yields the first inequality of (21). 

To prove the second inequality of (21), note that assumption (Al) and 
the definition of c mi „(a) imply that 

B 

(23) 1- J][l-^{c m , n (a)}] <a. 

6=1 

The maximum of Yl b =i 5r i{ c mfi( a )} under constraint (23) is obtained when 

^l{c m ,n{a)} = ■■■ = 1TB{Cm,n{oi)}. 

This implies TT b {c mtn (a)} < 1 — (1 — a) 1 ^ for all 6 = 1, ... ,B, so that 

B 

J> b {c m , n (a)} <B — B(l - a) 1/B , 

6=1 

and this is bounded above by — log(l — a) for all values of B. □ 

Lemma 4. Assume (Al), (Bl) and (B2). Let F b be the distribution 
of n b (W), where (J, b (W) is defined in (20). Then 

support(F 6 ) C [1 - log{l/(l - ajJa^mjjB" 1 , 1]. 
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Proof. Using assumption (B2) with constant r and the union bound, 
it holds that 

1 - fj, b (W) = P*{ P W(W) < c m , n {a)} < r\A b \ 

Cm,n (d) • 

Since tub = max^i # the support of F b is thus in the interval [1 — 
m B rc m , n (a),i\. 

Hence, the proof is complete if we show that 

(24) c m , n {a)<-\og{l-a)arB- 1 . 
To see that (24) holds, we first show that 

(25) l-a<P m { min Pj (W) > c m , n (a)\ < (1 - c m , n (a)/r) B 

The first inequality in (25) follows directly from the definition of c min (a); 
see (1). To prove the second inequality, note that assumption (Al) implies 
that 

B 

(26) P m < min pj (W) > c m , n (a) } = TT P m {p^(W) > c m , n (a)}. 

By assumption (Bl) and the law of iterated expectations, 
P m {p (b) (W) > c m , n (a)} = P m , G {p {b \GW) > c m , n (a)} 

(27) 

= E m {P m>G {pW(GW) > c m , n (a)\W}}. 
By assumption (B2), the conditional probability within each block satisfies 

P m , G {p [b \GW) > c m , n (a)\W} = P*{p^(W) > c m , n (a)} 
(28) <l-P*{pj b (W)< ^(a)} 

< 1 - c mi „(a)/r, 

where % S I(P m ) H A b . Since the right-hand side of (28) does not depend 
on W, the same bound holds for (27), where we also take the expectation 
over W. Using this result in (26), the second inequality in (25) follows. 
Finally, (25) implies 

c m ,n(«) < r{l - (1 - a) 1 ' 3 }. 

Since B>( 1 — (1 — a) l l B ) < — log(l — a) for all values of B, it follows that 1 — 
(l-a) 1 /- 8 < -log^l-a)^" 1 . This proves (24) and completes the proof. □ 

Lemma 5. Let U be a real-valued random variable with support [a,b] C 
[0, 1]. Suppose that the distribution of the two random variables X\ and Xi, 
conditional on U = u, is given by 

X±,X2 1 ~ Bernoulli (u). 
Then < E(X 1 X 2 ) - E(X 1 )E(X 2 ) <(b- a) 2 . 
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Proof. By the assumption that X\ and X 2 are Bernoulli conditional 
on U, it follows that E(X 1 \U) = E(X 2 \U) = U. Combining this with the 
law of iterated expectation and the fact that X\ and X2 are conditionally 
independent given U, we obtain 

E(X 1 X 2 ) = E U {E(X 1 X 2 \U)} = E u {E(X l \U)E(X 2 \U)} = E(U 2 ). 

Moreover, we have E(X l ) = E U {E{X 1 \U)} = E(U) and similarly E(X 2 ) = 
E(U). Hence, 

£(XiX 2 ) - E{X{)E{X 2 ) = E(U 2 ) - {E(U)} 2 = Var(C/). 
Finally, < Var({7) < (b — a) 2 by the assumption on the support of U. □ 

6.3. Proof of Theorem 2. First, note that (W) implies (B1)-(B3). Using 
the union bound and assumption (B3), it holds for any s & S n that ms is an 
upper bound for P m (mmj £l (p m )Pj(W) < s). Hence, 

Cm,n{oi) = maxjs G S n : P m ( min Pj(W) < s\ < a\ 

(29) ° &I[Pm) 
> maxjs G S n : ms < a}. 

This implies that the oracle critical value is larger than zero if the set {s G 
S n - ms < a } is nonempty, which is the case if mm(S n ) < a/m. The small- 
est possible two-sided Wilcoxon p- value is min(5 n ) = 2 ^"^ 2 ^, n ^ 2 ^ ! < 2 _n / 2+1 . 
Hence, it is sufficient to require that 2~ n / 2+1 < a/m, or equivalently, that 
n > 21og 2 (m/a) + 2. 

Note that (A3') implies (A3). Hence, under assumptions (W), (Al), (A2) 
and (A3'), the result in Theorem 1 applies. 

Let a G (0,1). We will now show that under assumptions (W), (Al) 
and (A3'), 

a_ — > a 

as m,n 00 such that n/log(m) —> 00, where a_ was defined in (8). De- 
fine c^ n (a) := min{s G S n :s > c m>n (a)}. Using the definition of a_ and 
assumption (Al), we have 

ct-=P m \ min pj(W) < c m , n (a) ) 
v je/(Pm) ' 

B 

(30) =1- ~ MCm.,n(u)}} 

6=1 
B 

= 1 - JJ[1 - vr fe {c+ n (a)} + Tr b {c+ n (a)} - 7ii,{c rn>n (a)}]. 
6=1 
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Define the function g m ,n-Y[b=i[^^b{cm,n{ a )}] K by 

g m ,n(u) ■= g m ,n(ui, . . .,Ub) 
B 

:= 1 - JJ[1 - vr fe {c+ jn (a)} + u b ], 

6=1 

so that the right-hand side of (30) equals g m ,n(w), where Wj, := vrb{c+ n (a)} — 
^b{c m ,n(oi)} for b = 1, ... ,B. A first-order Taylor expansion of gm,n(w) around 
(0,...',0) yields 

dg m ,n( U ) 



(31) Q- = g m ,n(w) = gm,n(0) + ^ 



duh 

b=i 

where R = o(^^ =1 Wf,). For all b = 1, . . . , B, we have 
dgm,n{u) 



+ R, 

u=0 



du h 



=- n [i-^-iO)}] 

l-5m,n(0) l-9m,n(0) 



1 - 7r b {c^ )n (a)} l-m jB Cm,n(a) 

where the inequality follows from 7T6{c+ n (a)} < msc+ n (a) for 6 = 1, . . . , B, 
by the union bound and assumption (B3). Plugging this into (31) yields 

ot- > g m ,n{°) - --, + / \ y w b + R 

1 - m B Cm,n(a) ^ 

(32) 

- 3m,n(UJ 1 + + , x - + ; s + ^- 

V 1 - m B Cm,n{a)J l-m B Cm,n{a) 

The definition of c+ n (a) implies that gm,n{ty > ol for all m and n. Hence, if 

B 

(33) ^^u;;,— >0 and m^c^ n (a) — >• 

6=1 

as m,n oo such that n/log(m) —> oo, then the right-hand side of (32) 
converges to a and the jproof is complete. 

We first consider J2b=i w b- By definition, there is no value s' G S n such 
that c m:n (a) < s' < c+ n (a). Hence, 

^6 = P m { . min Vj(W) = c+ (a)} 

< m B . max P m { Pj (W) = c+ (a)} 
3 aA b nl(P m ) 
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where the inequality follows from the union bound, and the last equality is 
due to assumption (B3). This implies 



^ + /c+ n (a) 

} w b < Bm B {4,„ a: ~ c mj „(a)} = Bc m ^{a)m B — '— — - 1 
^ ' \CmA a ) 

Similarly, we have 



B Cm, n {a) 



Note that Bc m ^ n {a) < log{l/(l — a)} by Lemma 3 and ra B = 0(1) (and 
hence B — > oo) by assumption (A3'). Hence, in order to prove (33), it suffices 
to show that 

(34) n (a)/c m>n (a) — > 1 as m,n — > oo,n/log(m) — > oo. 

Let the ordered p- values in S n , based on a two-sided Wilcoxon test with 
equal sample sizes n/2 in both classes, be denoted by sq < s\ < ■ ■ ■ < s rn , 
where r n = [n 2 /8 + lj . It is well known that 

(n/2)l(n/2)l ^ for ; = , . . . ,r n - 1 

3=0 

and s rn = 1, where q n (j) is the number of integer partitions of j such that nei- 
ther the number of parts nor the part magnitudes exceed n [and q n (0) = 1] [35] . 
Let i m>n satisfy s im n =c mj „(a). Then 

Cm,n(a) _ E}=0 +1 gn/2(j) 
C m ,n(«)" jy 3 =0Qn/2(j) ' 

This ratio converges to 1 if i m ^ n — > oo. Recall that c m ^ n (a) > max{s G S n : s < 
a/m} [see (29)]. Hence, 

, , , (n/2)\(n/2)\ , . 

4i,n( a ) = 2 n! 2^ (ln/2(j)>a/m. 

n ' i=o 

Since 2m{(n/2)!(n/2)!}/n! < m2~ n l 2 — >■ as m,n — > oo such that n/ 
log(m) — > oo, we have that under these conditions i m ^ n — > oo and c+ n (a)/ 
Cm,n(a) — >• 1- Thus (34) holds and hence implies (33), which completes the 
proof. 
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